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A  GRAPH  THEORETIC  TECHNIQUE  FOR  THE  GENERATION  OF  SBTOLXC 
IMPLEMENTATIONS  FOR  SHIFT-INVARIANT  FLOW  GRAPHS 


D.  A.  Schwartz  mod  T.  F.  Ramus  11  III 


Abstract 

This  paper  praaenta  a  general  aathod  far 
tha  transformation  of  algorithms  daacrlbad  by 
ahlf t-lnvarlant  ful  ly-apaciflad  flow  grapha  into 
aqulvalant  ayatollc  reallzatlona.  Tha  aathod 
cooalata  of  a  aat  of  rulaa  for  tha  ayataaatic 
manipulation  of  tha  flow  grapha  Into  ayatollc  fora 
utilizing  a  sat  of  thaoraas  froa  graph  thaory.  It 
la  shown  that  aany  of  tha  previously  published  aya¬ 
tollc  algorithms  and  aany  new  algorlthaa  can  be 
generated  using  thia  single  procedure. 

Introduction 


'  The  fundamental  goal  of  this  research  la  to 
develop  methods  for  the  automatic  and  optimal 
realization  of  a  large  class  of  Digital  Signal  Pro¬ 
cessing  (DSP)  algorithms  on  synchronous 
multiprocessors  composed  of  multiple.  Identical 
programmable  processors.  This  research  seeks  to 
find  the  most  efficient  possible  solutions.  In 
which  the  Intrinsic  synchrony  of  tha  system  main¬ 
tains  the  data  precedence  relations,  and  In  which 
no  cycles  of  any  of  the  processors  are  used  for 
system  control  (1).  DSP  algorithms,  as  a  class, 
are  uniquely  well  suited  to  this  approach  both 
beeauae  of  their  computational  Intensity  and 
because  of  their  high  level  of  Internal  structure. 

One  of  the  popular  methods  which  might  well  be 
used  to  accomplish  this  goal  Is  the  application  of 
systolic  arrays  to  DSP  Implementations.  Over  ehe 
past  several  years,  there  has  been  considerable 
Interest  In  systolic  Implementations,  and  a  number 
of  systolic  algorithms  have  been  published.  For  the 
most  part,  these  algorithms  have  not  been  derived 
by  any  formal  method,  but  rather  have  been 
developed  and  presented  separately,  sometimes 
without  extensive  verification.  The  purpose  of  this 
peper  Is  to  present  a  simply  applied  set  of 
techniques  which  allow  for  the  systematic 
derivations  of  systolic  solutions  for  a  large  and 
Interesting  class  of  algorithms. 

Flow  Craph  Representation 

In  this  research,  the  algorithms  to  be  Imple¬ 
mented  are  all  described  using  fully-specified  flow 
graph  representations.  As  Is  illustrated  In  Fig. 
1,  a  fully-specified  flow  graph  la  a  directed  graph 
In  which  all  operations  occur  at  the  nodes,  sad  In 
which  the  branches  are  used  exclusively  as  signal 
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paths.  In  addition,  the  graph  Is  constrained  eo  that 
its  node  operations  arm  each  fundamental  operations 
of  the  constituent  processors  which  are  to  be  used 
in  the  In  the  implemntatlon.  More  precisely,  the 
node  operations  represent  the  granularity  with  which 
the  parallelism  of  the  flow  graph  can  be 
manipulated,  and  they  should  be  chosen  accordingly. 
The  fully-specified  flew  graph  la  a  very  peoerful 
representation  which.  If  properly  applied.  Is  not 
only  capable  of  describing  such  traditional  signal 
flow  graph  structures  as  digital  filtera  aad  fast 
transforms,  but  also  such  nonlinear  structures  as 
those  Involving  decimation.  Interpolation,  homo¬ 
morphic  proeesalng,  and  a  large  class  of  matrix 
operations.  In  addition,  by  allowing  the  nodes  to 
be  low  level  logic  operations,  these  flow  graphs  can 
also  describe  blt-sezlal,  byte-serial,  aad  many 
other  dletributed  arithmetic  structures. 


Flow  graph  Rounds 

Clven  that  only  one  processor  type  is  to  be 
used  In  the  eventual  multiprocessor  Implementation 
and  given  that  the  characteristics  of  this  con¬ 
stituent  processor  are  known,  then  it  Is  possible  to 
compute  bounds  on  the  synchronous  multiprocessor 
resllzation  of  a  fully  specified  flow  graph.  Two 
bounds  are  of  particular  Interest.  The  first  bound, 
called  the  sample  period  bound.  Involves  the  minimi 
sampling  period  at  which  a  particular  algorithm  can 
be  Implemented  using  a  particular  constituent  pro¬ 
cessor.  The  sample  period  bound  Is  best  understood 
In  the  context  of  a  recursive  slngle-tlme-ladex' 
flow  graph  (such  as  an  IIR  digital  filter),  although 
the  concept  Is  also  meaningful  in  systems  which  have 
no  explicit  sampling  period.  For  such  systems,  the 
sample  period  bound  Is  given  by 


V  MAXt  d  /nj 
F 

where  p  varies  over  the  set  of  all  loope  in  tbs  flow 
graph,  d.  Is  the  arithmetic  delay  In  tha  loop  p  aad 
n_  Is  tne  number  of  nit  delays  nodes  In  loop  p. 
This  result  is  a  generalisation  of  a  result 
published  by  Renfors  and  Nuevo  (2). 

The  second  bound,  ealled  the  static-plea line 
sample  period  bound.  Is  of  particular  Interest  in 
the  derivation  of  systolic  Implementations.  This 
bound  Is  the  minimum  sampling  period  which  can  be 
achieved  If  the  entire  graph  Is  Implemented  using  a 
static  pipeline.  A  static  pipeline  Is  an 
Implementation  In  which  the  node  operations  af  tbs 
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graph  arc  explicitly  assigned  to  Individual  proces- 
aora,  and  In  vhlch  every  application*  of  any 
particular  node  operation  la  alvaye  perforaed  by 
the  sane  proceaeor.  A  static  pipeline  realisation 
for  a  flow  graph  can  be  considered  to  be  a  coaplete 
partitioning  of  the  flow  graph  In  apace!  In  which 
each  node  In  the  flow  graph  la  aaalgned  to  a 
particular  processor.  This  is  to  be  contrasted  with 
cyclo-static  lapleaentatlone  such  as  SSIMD  (1)  In 
which  different  tine-index  (or  space-index) 
applications  of  a  particular  node  operation  nay  be 
realised  by  different  proeessore  at  different 
tinea.  In  general,  cyclo-static  lnplenentatlona  nay 
achieve  the  sanpl e-period  bound  while  static- 
pipeline  lnplenentaclons  can  only  achieve  the 
static  pipeline  aanple  period  bound. 


data  streans  la  essentially  equivalent  to  a  single 
aysten  for  which  the  H  seta  of  Inputs  and  entpota 
have  been  separately  Interleaved  as  ordered  sets 
and  the  order  of  all  the  delay  nodea  In  the  flow 
graph  has  been  nultiplled  by  I. 

COROLLARY:  A  shift  Invariant  aysten  is  always 
essentia  lly  equivalent  to  a  shift  Invariant  eyeten 
where  the  Input  has  been  up-saapled  by  R,  the  output 
has  been  down-sanpled  by  H,  and  the  order  of  all  the 
delay  nodes  has  been  nultiplled  by  & 

A  Bodal  Cutset  la  defined  aa  that  eat  of 
branches  which  are  cut  when  a  closed  surface  la 
constructed  Inside  a  flow  graph  In  snch  a  way  that 
It  passes  through  no  nodes. 


Like  the  sanpl  lag  period  bound,  a  static  pipe¬ 
line  saaple  period  bound  la  first  coaputed  for  each 
loop,  and  then  the  overall  bound  for  the  graph  la 
coaputed  as  the  naxlnun  of  the  individual  loop 
bounds.  Each  loop  Individually  can  be  thought  of  as 
consisting  of  a  set  of  operation  nodes  and  a  aet  of 
delay  nodea.  The  static  pipeline  bound  for  a  loop 
la  coaputed  by  assualng  that  the  delays  eleaents 
nay  be  distributed  throughout  the  loop  In  any 
desired  configuration  and  by  finding  that  dis¬ 
tribution  of  delays  which  alnlalsas  the  naxlnun 
operation  tine  between  two  consecutive  delay  ele- 
aenta.  This  alnl-max  operation  tine  Is  the  static- 
pipeline  aanple  period  bound  for  the  loop  and  the 
naxlnun  value  of  all  auch  loopa  bounds  la  the 
static-pipeline  saaple  period  bound  for  the  flow 
graph,  and  can  be  written  as 
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Optlnallty 

This  work  aakes  use  of  two  separate  defini¬ 
tions  of  optlnallty.  An  laplenentation  la  said  to 
he  rate  wtlaal  If  It  achieves  the  sanpl  lag  period 
bound  and  la  said  to  be  processor  opt  Inal  If  It 
exhibits  perfect  processor  efficiency  so  that  every 
cycle  of  every  processor  Is  used  directly  on  the 
fundaaental  operations  of  the  algorltha  and  no 
cycles  are  used  for  synchronisation  or  systea 
control.  Clearly,  these  three  definitions  of 
optlnallty  are  non-excluslve,  and  any  particular 
laplenentation  nay  satisfy  any  coablnatlon  of  these 
optlnallty  criteria. 

Rigorous  Systolic  Derivations 

Two  slngle-tlae-lndex  systsas  are  aaltf  to  be 
essentially  equivalent  If  given  the  aaae  Input  se¬ 
quences  ,  they  always  give  the  sane  output 
sequences.  The  systolic  derivation  procedure  is 
based  on  two  theoreas  concerning  the  essential 
equivalence  of  systsas  described  by  flow  graphs. 

TIE  DATA  IDTI REBATE  THEOREM!  A  sat  of  R  Identical 
shift  invariant  systsas  operating  on  R  separata 


THE  CUTSET  DELAY  TRANSFORMATION  THEOREM:  Any  shift 
Invariant  flow  graph  la  essentially  equivalent  to  a 
flow  graph  which  Is  foraed  by  adding  Ideal  delay 
(advance)  nodes  to  all  the  Input  branchaa  In  a  nodal 
cutset  and  adding  Ideal  advance  (delay)  nodes  to  all 
the  output  branches  In  the  saae  nodal  cutset. 

A  fundaaental  constraint  placed  on  systolic 
arrays  in  their  definition  (3)  Is  that  the  tiaasfer 
of  data  between  cells  aust  be  alaultaneoua.  This 
translates  Into  a  flow  graph  constraint  that  every 
output  branch  fron  a  cell  aust  be  teralnatad  by  a 
delay  node  (pipeline  register).  Hence,  the  gen¬ 
eration  of  systolic  solutions  for  flow  graphs  re¬ 
duces  to  the  distribution  of  the  delay  nodes 
throughout  the  flow  graph  so  that  this  condition  Is 
net.  In  this  procedure,  the  static  pipeline  saaple 
period  bounds  for  the  Individual  loops  in  the  flow 
graph  are  used  to  deteralne  where  the  delay  nodes 
should  be  redistributed,  the  required  Interleaving 
factor,  and  the  appropriate  nodal  cutsets.  If  the 
saaple  period  Is  known.  It  Is  always  slaple  to 
Introduce  delay  nodes  Into  the  nonreeurslve  portions 
of  the  graph  such  that  the  naxlnun  delay  between 
pipeline  registers  Is  less  than  or  equal  to  the 
saaple  period. 

The  way  in  which  these  theoreas  are  aaed  to 
derive  a  systolic  laplenentation  fron  a  fully  opacl- 
fled  flow  graph  Is  Illustrated  in  Tig.  1.  Other 
exaaplas  of  the  derivation  of  systolic  laplanenta- 
tlons  fron  flow  graphs  are  given  in  (1).  In  parti¬ 
cular,  Tig.  1  Illustrates  the  generation  of  the 
systolic  laplenentation  for  the  two  aultlplier 
Markel-Cray  lattice  filter.  This  exaaple  has  been 
chosen  for  four  specific  reasons.  First,  and  nost 
lnportant,  It  was  chosen  because  It  uses  all  of  the 
thaorans  and  exhibits  all  of  the  properties  which 
are  typical  of  the  derivation  of  systolic  inple- 
aentatlons  fron  flow  graphs.  Second,  It  Illustrates 
clearly  the  central  roll  of  the  Data  Interleave 
Theorea  In  finding  systolic  derivations  for  secur- 
slve  systens.  Third,  In  spits  of  being  typical  In 
the  ways  the  theoreas  are  applied,  this  particular 
derivation  Itself  has  sons  additional  interacting 
and  surprising  features  which  are  worthy  of  note. 
Finally,  the  systolic  inplenentatlons  presented  here 
for  this  lnportant  digital  filter  structure  have  not 
been  presented  before,  and  are  Interesting  In  their 
own  right. 

Fron  Fig.  la.  It  Is  clear  that  the  saaiple 


period  bound  for  this  fora  of  the  lactic*  filter  ia 
gives  by  T  •  ta+2ta,  where  t,  and  ta  are  the 
multiply  and  add  tiaea.  Alao,  since  this  filter  la 
recursive,  it  is  clear  that  there  aust  be  a  bi¬ 
directional  data  flow  between  the  eystolle  cells. 
The  problea  la  that,  in  its  original  fora,  thara 
are  not  enough  delays  nodes  Inside  the  loop*  to 
aeet  the  systolic  requirement  for  positioning  delay 
elements  at  all  output  branches  (It  should  be 
obvious  here  that  the  Cutset  Delay  Transformation 
Theorea  will  always  conserve  Che  nuaber  of  delay 
elaaents  in  a  loop,  although  It  will  allow  thea  to 
be  easily  redistributed).  As  a  result,  an  Inter¬ 
leaving  transformation  is  required  to  generate  the 
needed  delay  nodes.  Pig.  lb  shows  the  filter  after 
a  2-way  Interleaving  tra  nsformatlon.  Computing  the 
static-pipeline  bound  for  this  flow  graph  results 
in  a  sampling  period  of  Tp  -  t>+2tJ.  However,  for 
this  case,  this  is  not  an  achievable  bound  since 
the  lattice  filter  has  coupled,  overlapping  loops. 
It  is  hence  necessary  to  arrange  the  delays  la  aaeh 
loop  so  they  do  not  conflict  with  the  delay 
requirements  of  other  loop*.  With  this  added 
requlreaent,  the  achievable  pipeline  aaapl*  period 
bound,  Tb  ,  is  given  by  Ta  -  2tB+2ta.  Hlth 
knowledge  of  the  aaapl*  period  bounds,  the  cutsets 
are  easily  constructed  as  shown  In  Pig.  lb.  The 
resulting  network  la  shown  In  Pig.  lc.  This  network 
is  now  in  systolic  fora.  The  corresponding  systolic 
cell  Interconnection  details  are  shown  la  Fig.  Id, 
where  the  last  cell  (Type  II)  1*  a  degenerate  fora 
of  the  other  cells  (Type  I).  When  the  2-way 
Interleaving  Is  considered,  this  implementation  can 
achieve  a  sampling  period  of  T  “  At  +4ta.  This  is 
well  short  of  the  saapl*  period  bound,  and  whan  no 
second  signal  is  available  to  be  interleaved,  this 
represents  a  SOX  processor  efficiency. 


Another  Important  point  Is  Illustrated  in  Pig 
le.  The  point  Is  that  there  exists  a  systolic  fora 
which  has  a  smaller  aaapl*  period  than  the  system 
of  Pig.  lc.  This  lower  sampling  period  Is  achieved 
by  applying  a  3-way  interleaved  transformation, 
which,  after  the  appropriate  cutset  transformations 
have  been  applied,  result  In  the  systolic  network 
of  Fig.  le.  In  Pig.  le,  the  dashed  lines  partition 
the  network  into  systolic  cell*.  This  new  network 
has  a  static-pipeline  eample  period  bound  of  T.  “ 
T.  -  tp+tp.  With  the  interleaving  considered,  this 
gives  an  achievable  sampling  period  of  Tg  - 
3ta+3ta.  While  this  Is  A/3  times  faster  than  the  2- 
way  Interleaved  fora.  It  still  does  not  achieve  tb* 
optlaal  sample  period  bound,  so  it  Is  not  rate- 
optimal.  For  this  particular  network,  the  optlaal 
sample  period  bound  cannot  be  achieved  with  a 
systolic  implementation,  nor  will  any  higher 
Interleaving  factor  result  in  a  faster  processing 
rat*.  However,  for  any  given  network  It  Is  clear 
that  there  Is  an  easily  deteralned  optlaua 
Interleaving  factor  that  achieves  the  minimum 
possible  static-pipeline  saapl*  period  bound,  and 
for  some  networks,  this  Is  equal  to  the  sample 
period  bound. 


Optlaal  Pipeline  Solutions 


are  rigorous  la  the  sense  that  each  step  Is  achieved 
by  the  application  of  a  theorea  from  graph  theory  to 
the  flow  graph,  and  so  long  as  the  theorems  are 
properly  applied,  the  resulting  lap  lamentation  la 
guaranteed  to  he  correct.  This  method  Is  simple  to 
apply  (In  fact.  It  can  be  largely  automated),  emd 
should  be  of  gxset  utility  In  the  study  of  systolic 
algorithms.  It  also  serves  to  gain  further  insight 
into  the  fundamental  nature  of  systolic  algorithms 
For  sxaaple,  the  50X  efficiency  so  often  found  la 
systolic  implementations  can  be  seen  to  bn  a  rnaalt 
of  the  required  2 -way  interleave  transformation  for 
recursive  systams  (1/H  efficiency  for  R-wsy  inter¬ 
leaved  aysten  with  one  data  stream).  However,  a 
point  which  Is  made  clear  In  this  context  which  la 
not  clear  from  the  original  systolic  presentations 
Is  that  1002  efficiency  can  easily  be  obtalnsd  If 
two  independent  data  streams  are  available  to  be 
processed  simultaneously. 

The  real  problem  Is  that  traditional  systolic 
approaches  do  not  necessarily  lead  to  the  best  syn¬ 
chronous  multiprocessor  implementations.  A  fundamen¬ 
tal  reason  for  this  Is  that  the  basic  systolic 
definition  requires  that  all  of  the  data  tranafers 
between  cells  mmit  be  dona  simultaneously.  This  can 
lead  to  both  very  low  processor  efficiency  and  a  lew 
achievable  sampling  rate.  It  1*  simple  to  understand 
why  this  is  true.  The  global  systolic  clock  leads  to 
a  succession  of  information  wavefronts  separated  by 
one  global  clock  period.  The  global  clock  period 
clearly  must  be  greater  than  or  equal  to  the  largest 
cell  processing  delay  in  the  system  and  also,  as 
Illustrated  above,  recursive  systems  require  inter¬ 
leaving  trsnafometlons  which  lead  to  the  Introduc¬ 
tion  of  extra,  padding"  wavefronts.  Therefore  if 
all  the  cell  processing  delays  are  not  of  eqoal 
duration  or  If  padding  wavefronts  are  present,  then 
the  processors  axe  not  always  doing  useful  work.  If 
the  global  systolic  clock  constraint  is  relaxed, 
then  the  Information  wavefronts  nay  be  separated  by 
less  than  the  aexlmun  cell  processing  delay,  amd 
much  more  efficient  implementations  are  possible. 
SSIMD  la  a  perfect  Illustration  of  this  effect  (1). 
Ia  SSIKD,  the  processing  delay  of  each  cell  1*  ty¬ 
pically  quite  long  (since  each  cell  generally  in¬ 
cludes  all  the  operation*  In  one  Iteration  of  the 
flow  graph),  bat  the  Information  wavefronts  are 
separated  by  only  a  small  fraction  of  a  cell  precas¬ 
ting  delay.  This  extra  flexibility  1*  on*  reason  nhy 
SSIKD  and  other  cyclo-static  Implementations  can 
achieve  processor  optimal  and  rat*  optlaal  solutions 
when  systolic  fcaplenentatlone  for  the  same  flow 
graph  cannot  (IV 
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a)  Rill;  eparl  fieri  signal  floe  graph  rf  a 
third  order  hhifcal-ttay  lattice  f litas. 


b)  Flow  graph  after  a  >*ay 
interleaved  ttaneforaati/ai.  thrtad 
llnaa  lodlcete  daelred  cossets  lor 
daisy  i 


c)  Systolic  fin  of  2  »y  lisarlaewri 
lames  filter,  after  cutssc  daisy 
tranafiuaarlona.  Cashed  lines  *««>— 
systolic  cell  partition. 


d)  DkxII  of  wi  7  lofinoaaaetlM* 


s)  Systolic  fans  of  Vey  laser- 
lsevsd  lattias  filter.  OMhsd  Ubh 
indicate  systolic  call  psrttans. 


